Study on Building a High-Quality Homepage Collection from the Web Considering Page Group Structures

نویسنده

  • Yuxin WANG
چکیده

This disseration is devoted to investigate the method for building a high-quality homepage collection from the web efficiently by considering the page group structures. We mainly investigate in researchers’ homepages and homepages of other categories partly. A web page collection with a guaranteed high quality (i.e., high recall and high precision) is required for implementing high quality web-based information services. Building such a collection demands a large amount of human work, however, because of the diversity, vastness and sparseness of web pages. Even though many researchers have investigated methods for searching and classifying web pages, etc., most of the methods are best-effort types and pay no attention to quality assurance. We are therefore investigating a method for building a homepage collection efficiently while assuring a given high quality, with the expectation that the investigated method can be applicable to the collection of various categories of homepages. This dissertation consists of seven chapters. Chapter 1 gives the introduction, and Chapter 2 presents the related work. Chapter 3 describes the objectives, the overall performance goal of the investigated system, and the scheme of the system. Chapters 4 and 5 discuss the two parts of our two-step-processing method in detail respectively. Chapter 6 discusses the method for reducing the processing cost of the system, and Chapter 7 concludes the dissertation by summarizing it and discussing future work. Chapter 3, taking into account the enormous size of the real web, introduces a two-step-processing method comprising rough filtering and accurate classification. The former is for narrowing down the amount of candidate pages efficiently with the required high recall and the latter is for accurately classifying the candidate pages into three classes—assured positive, assured negative, and uncertain—while assuring the required recall and precision.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Building web page collections efficiently exploiting local surrounding pages

This paper describes a method for building a high-quality web page collection with a reduced manual assessment cost that exploits local surrounding pages. Effectiveness of the method is shown through experiments using a researcher’s homepage as an example of the target categories. The method consists of two processes: rough filtering and accurate classification. In both processes, we introduce ...

متن کامل

Designing a Volunteer Geographic Information-based service for rapid earth quake damages estimation

Designing a Volunteer Geographic Information-based service for rapid earth quake damages estimation Introduction The advent of Web 2.0 enables the users to interact and prepare free unlimited real time data. This advantage leads us to exploit Volunteer Geographic Information (VGI) for real time crisis management. Traditional estimation methods for earthquake damages are expensive and tim...

متن کامل

Overview of the TREC 2003 Web Track

The TREC 2003 web track consisted of both a non-interactive stream and an interactive stream. Both streams worked with the .GOV test collection. The non-interactive stream continued an investigation into the importance of homepages in Web ranking, via both a Topic Distillation task and a Navigational task. In the topic distillation task, systems were expected to return a list of the homepages o...

متن کامل

Automatic Extraction of Complex Web Data

A new wrapper induction algorithm WTM for generating rules that describe the general web page layout template is presented. WTM is mainly designed for use in weblog crawling and indexing system. Most weblogs are maintained by content management systems and have similar layout structures in all pages. In addition, they provide RSS feeds to describe the latest entries. These entries appear in the...

متن کامل

Hybrid Adaptive Educational Hypermedia ‎Recommender Accommodating User’s Learning ‎Style and Web Page Features‎

Personalized recommenders have proved to be of use as a solution to reduce the information overload ‎problem. Especially in Adaptive Hypermedia System, a recommender is the main module that delivers ‎suitable learning objects to learners. Recommenders suffer from the cold-start and the sparsity problems. ‎Furthermore, obtaining learner’s preferences is cumbersome. Most studies have only focused...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006